Cross-modal sensor data alignment

ABSTRACT

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining an alignment between cross-modal sensor data. In one aspect, a method comprises: obtaining (i) an image that characterizes a visual appearance of an environment, and (ii) a point cloud comprising a collection of data points that characterizes a three-dimensional geometry of the environment; processing each of a plurality of regions of the image using a visual embedding neural network to generate a respective embedding of each of the image regions; processing each of a plurality of regions of the point cloud using a shape embedding neural network to generate a respective embedding of each of the point cloud regions; and identifying a plurality of region pairs using the embeddings of the image regions and the embeddings of the point cloud regions.

CROSS-REFERENCE TO RELATED APPLICATION

This patent application is a continuation (and claims the benefit ofpriority under 35 USC 120) of U.S. patent application Ser. No.16/509,152, filed Jul. 11, 2019. The disclosure of the prior applicationis considered part of (and is incorporated by reference in) thedisclosure of this application.

BACKGROUND

This specification relates to processing cross-modal sensor datagenerated by camera sensors and surveying sensors.

A camera sensor can generate an image that characterizes a visualappearance of an environment. A surveying sensor (e.g., a radar or lidarsensor) can generate a point cloud that characterizes athree-dimensional (3D) geometry of an environment.

Sensor data generated by camera sensors and surveying sensors can beprocessed by machine learning models. Neural networks are machinelearning models that employ one or more layers of nonlinear units topredict an output for a received input. Some neural networks are deepneural networks that include one or more hidden layers in addition to anoutput layer. The output of each hidden layer is used as input to thenext layer in the network, i.e., the next hidden layer or the outputlayer. Each layer of the network generates an output from a receivedinput in accordance with current values of a respective set ofparameters.

SUMMARY

This specification describes a system implemented as computer programson one or more computers in one or more locations that can determine analignment between cross-modal sensor data. Determining an alignmentbetween two sets of sensor data (e.g., between image data and pointcloud data) refers to determining a mapping between respective regionsof the two sets of sensor data that characterize the same area of theenvironment.

According to a first aspect there is provided a method includingobtaining: (i) an image, generated by a camera sensor, thatcharacterizes a visual appearance of an environment, and (ii) a pointcloud comprising a collection of data points, generated by a surveyingsensor, that characterizes a three-dimensional geometry of theenvironment. Each data point defines a respective three-dimensionalspatial position of a point on a surface in the environment. Each ofmultiple regions of the image are processed using a visual embeddingneural network to generate a respective embedding of each of the imageregions. Each of multiple regions of the point cloud are processed usinga shape embedding neural network to generate a respective embedding ofeach of the point cloud regions. A set of region pairs are identifiedusing the embeddings of the image regions and the embeddings of thepoint cloud regions. Each region pair includes a respective image regionand a respective point cloud region that characterize the samerespective area of the environment.

In some implementations, the surveying sensor is a lidar sensor or aradar sensor.

In some implementations, the camera sensor and the surveying sensor aremounted on a vehicle.

In some implementations, each data point in the point cloud additionallydefines a strength of a reflection of a pulse of light that wastransmitted by the surveying sensor and that reflected from the point onthe surface of the environment at the three-dimensional spatial positiondefined by the data point.

In some implementations, the method further includes using the set ofregion pairs to determine whether the camera sensor and the surveyingsensor are accurately calibrated.

In some implementations, the method further includes obtaining datadefining a position of an object in the image, and identifying acorresponding position of the object in the point cloud based on: (i)the position of the object in the image, and (ii) the set of regionpairs.

In some implementations, identifying the corresponding position of theobject in the point cloud based on: (i) the position of the object inthe image, and (ii) the set of region pairs, includes identifyingparticular regions pairs, such that for each particular region pair, theimage region of the region pair corresponds to the position of theobject in the image. The position of the object in the point cloud isdetermined based on the respective point cloud region of each particularregion pair.

In some implementations, the method further includes obtaining datadefining a position of an object in the point cloud, and identifying acorresponding position of the object in the image based on: (i) theposition of the object in the point cloud, and (ii) the plurality ofregion pairs.

In some implementations, identifying the corresponding position of theobject in the image based on: (i) the position of the object in thepoint cloud, and (ii) the set of region pairs, includes identifyingparticular regions pairs, such that for each particular region pair, thepoint cloud region of the region pair corresponds to the position of theobject in the point cloud. The position of the object in the image isdetermined based on the respective image region of each particularregion pair.

In some implementations, the method further includes projecting thepoint cloud onto a two-dimensional image plane that is aligned with theimage using the plurality of region pairs. The image and the projectedpoint cloud are processed using a neural network to generate a neuralnetwork output.

In some implementations, the neural network output includes dataidentifying positions of objects in the environment.

In some implementations, the multiple image regions cover the image.

In some implementations, the multiple point cloud regions cover thepoint cloud.

In some implementations, identifying the region pairs using theembeddings of the image regions and the embeddings of the point cloudregions includes identifying a set of embedding pairs, such that eachgiven embedding pair includes the embedding of a given image region andthe embedding of a given point cloud region. A respective region paircorresponding to each of the embedding pairs is identified, where theregion pair corresponding to a given embedding pair includes the givenimage region and the given point cloud region corresponding to the givenembedding pair.

In some implementations, the embedding pairs are identified based atleast in part on, for each embedding pair, a respective similaritymeasure between the embedding of the given image region and theembedding of the given point cloud region included in the embeddingpair.

In some implementations, the embedding pairs are identified using agreedy nearest neighbor matching algorithm.

In some implementations, the visual embedding neural network and theshape embedding neural network are jointly trained using a triplet lossobjective function or a contrastive loss objective function.

Particular embodiments of the subject matter described in thisspecification can be implemented so as to realize one or more of thefollowing advantages.

The alignment system described in this specification can determine analignment between an image (i.e., generated by camera) and a point cloud(i.e., generated by a surveying sensor, e.g., a lidar sensor). That is,the alignment system can determine a mapping between respective regionsof the image and the point cloud that characterize the same area of anenvironment.

The alignment system can be used by an on-board system of a vehicle,e.g., for sensor calibration, cross-modal object localization, or objectdetection (as will be described in more detail below). By using thealignment system, the on-board system of the vehicle can generateplanning decisions that plan the future trajectory of the vehicle andenable the vehicle to operate more safely and efficiently.

The alignment system can determine the alignment between an image and apoint cloud using embeddings of image regions and embeddings of pointcloud regions that are generated by respective embedding neuralnetworks. By using embeddings generated by embedding neural networksthat are trained using machine learning techniques, the alignment systemcan, in some cases, generate alignments more accurately than if it usedembeddings composed of hand-crafted features (e.g., HOG, SIFT, or SURFfeatures). In particular, the alignment system described in thisspecification uses region embeddings that are optimized (i.e., usingmachine learning techniques) to achieve accurate alignments. Incontrast, embeddings composed of hand-crafted features are not optimizedto achieve accurate alignments, and may therefore underperform thelearned region embeddings described in this specification, e.g., byresulting in less accurate alignments.

The alignment system described in this specification can align sensordata more rapidly than some conventional systems, and in some cases, mayconsume fewer computational resources than conventional systems. Morespecifically, to align two sets of sensor data, the system described inthis specification determines embeddings of respective regions of thetwo sets of sensor data, and then matches the embeddings, e.g., using anearest-neighbor matching technique. In contrast, some conventionalsystems align two sets of sensor data by iteratively optimizing a set ofparameters (e.g., rotation or transformation parameters) defining thealignment based on an objective function that characterizes how well thetwo sets of sensor data are aligned. Determining the alignment byiteratively optimizing an objective function can be computationallydemanding, e.g., requiring several seconds or longer to align two datasets. In practical applications, e.g., in an on-board system of avehicle, a latency of several seconds in aligning data sets may beinfeasible, as the data sets may be outdated by the time they arealigned. The system described in this specification can, in some cases,determine alignments more rapidly than these conventional systems, andtherefore can be effectively used in a greater number of practicalapplications, e.g., by an on-board system of a vehicle.

The alignment system can be used to generate training data for trainingan object detection neural network that is configured to process a pointcloud to generate an output that identifies the positions of objects inthe point cloud. In particular, object segmentations can be transferredfrom image data onto corresponding point cloud data using alignmentsgenerated by the alignment system. Segmentations of point cloud datagenerated in this manner can thereafter be used to train the objectdetection neural network. In some cases, manually segmenting images issubstantially easier, faster, and more accurate than manually segmentingpoint cloud data. Therefore, the alignment system can simplify thegeneration of training data for training the object detection neuralnetwork by facilitating the transfer of segmentations from image dataonto point cloud data.

The details of one or more embodiments of the subject matter of thisspecification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example cross-modal alignment system.

FIG. 2 is an illustration of an example of a mapping between respectiveregions of an image and a point cloud.

FIG. 3 is a block diagram of an example on-board system of a vehicle.

FIG. 4 illustrates an example data flow for jointly training the visualembedding neural network and the shape embedding neural network.

FIG. 5 is a flow diagram of an example process for determining a mappingbetween respective regions of an image and a point cloud thatcharacterize the same area of an environment.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

This specification describes a cross-modal alignment system that candetermine an “alignment” between cross-modal data characterizing anenvironment. For example, the alignment system can determine analignment between: (i) an image (e.g., generated by a camera) thatcharacterizes the visual appearance of the environment, and (ii) a pointcloud (e.g., generated by a lidar or radar sensor) that characterizesthe three-dimensional (3D) geometry of the environment. Determining analignment between the image and the point cloud refers to determining amapping between respective regions of the image and the point cloud thatcharacterize the same area of the environment. Determining an alignmentbetween the image and the point cloud may also be referred to as“registering” the image and the point cloud, or performing “scanmatching” between the image and the point cloud. Cross-modal alignmentsgenerated by the alignment system can be used by an on-board system of avehicle for any of a variety of purposes, e.g., sensor calibration,cross-modal object localization, or object detection.

Generally, the alignment system described in this specification can beused to determine alignments between any appropriate data sets. Forexample, the alignment system can be used to determine an “intra-modal”alignment between two data sets of the same modality, e.g., two imagesor two point clouds. As another example, while this specificationprimarily refers to the alignment system as being used by an on-boardsystem of a vehicle, the alignment system can be used in any of avariety of other settings as well. In a particular example, thealignment system can be used to align two medical images of a patient,e.g., a magnetic resonance image (MRI) and a computed tomography (CT)image of the patient, or an ultrasound (US) image and an Mill of thepatient.

These features and other features are described in more detail below.

FIG. 1 shows an example cross-modal alignment system 100. Thecross-modal alignment system 100 is an example of a system implementedas computer programs on one or more computers in one or more locationsin which the systems, components, and techniques described below areimplemented.

The system 100 is configured to process an image 102 (e.g., generated bya camera) and a point cloud 104 (e.g., generated by a lidar or radarsensor) that both characterize an environment to determine an“alignment” between the image 102 and the point cloud 104. The alignmentbetween the image 102 and the point cloud 104 is defined by a mappingbetween respective regions of the image 102 and the point cloud 104 thatcharacterize the same area of the environment. FIG. 2 illustrates anexample of a mapping between respective regions of an image and a pointcloud.

The image 102 characterizes the visual appearance of the environment,and may be captured using any appropriate type of digital camera sensor.

In one example, the image 102 may be a black-and-white image representedby a two-dimensional (2D) array of numerical values, where eachcomponent of the array corresponds to a respective area of theenvironment. In another example, the image 102 may be represented by aset of multiple “channels”. In this example, each channel may berepresented by a respective 2D array of numerical values, where eachcomponent of each channel corresponds to a respective area of theenvironment, and corresponding components of different channelscorrespond to the same area of the environment. In a particular example,the image 102 may be a red-green-blue (RGB) color image represented by ared color channel, a green color channel, and a blue color channel.

The point cloud 104 characterizes the 3D geometry of the environment,and may be captured using any appropriate type of “surveying” sensor,e.g., a lidar sensor or a radar sensor.

Generally, the point cloud 104 is represented by a collection of “datapoints”, where each data point defines a 3D spatial position of a pointon a surface in the environment. For example, each data point may berepresented by a vector including respective x-, y-, and z-coordinatesthat define a 3D spatial position of a point on a surface in theenvironment with respect to a 3D coordinate system. In one example, the3D coordinate system may be a Euclidean coordinate system centered on avehicle on which the surveying sensor is mounted.

Optionally, each data point in the point cloud may include additional“intensity” information that characterizes, e.g., the reflectivity,texture, or density of the material at the 3D spatial position in theenvironment corresponding to the data point.

For example, the intensity information included in a data point in thepoint cloud may be defined by the strength of the reflection of a pulseof light that was transmitted by a lidar sensor and that reflected off asurface in the environment at the 3D spatial position corresponding tothe data point. In this example, each data point in the point cloud maybe represented by a vector including both: (i) respective x-, y-, andz-coordinates that define a 3D spatial position of a point on a surfacein the environment, and (ii) an intensity value that defines thestrength of the reflection of a pulse of light that reflected off asurface in the environment at the 3D spatial position.

The system 100 includes a visual embedding neural network 106, a shapeembedding neural network 108, and a matching engine 110.

The system 100 uses the visual embedding neural network 106 to generatea respective embedding 112 of each of multiple regions of the image 102.An embedding of an image region refers to a representation of the imageregion as an ordered collection of numerical values, e.g., a vector ofnumerical values. A region of an image refers to a portion of the image,e.g., that is enclosed by a square or circular 2D geometrical region.For example, a region of an RGB image with channels of dimension[100,100] (i.e., with 100 rows and 100 columns) may be given by therespective portion of each channel corresponding to pixels in the squareregion [42:46, 91:95] (i.e., with row index between 42 and 46, andcolumn index between 91 and 95). An example region in the image 102 isillustrated by 114.

The visual embedding neural network 106 may be configured to process theentire image 102 to generate an output that defines a respectiveembedding 112 of each region of the image 102. Alternatively, ratherthan processing the entire image 102 at once, the visual embeddingneural network 106 may be configured to process individual image regionsto generate respective embeddings of the image regions.

Generally, the visual embedding neural network 106 can have anyappropriate neural network architecture that enables it to generateembeddings of image regions. For example, the visual embedding neuralnetwork 106 may include a set of multiple convolutional layers followedby a fully-connected output layer after the final convolutional layer.

The system 100 uses the shape embedding neural network 108 to generate arespective embedding 116 of each of multiple regions of the point cloud104. An embedding of a point cloud region refers to a representation ofthe point cloud region as an ordered collection of numerical values,e.g., a vector of numerical values, e.g., a bit vector consisting of 0sand 1s. An embedding of a point cloud region can be also be referred toas, e.g., feature vector or a feature descriptor for the point cloudregion. A region of a point cloud refers to a collection of data pointsfrom the point cloud, e.g., corresponding to 3D spatial positions thatare enclosed in a cubical or spherical 3D geometrical region. Forexample, a region of a point cloud where each data point corresponds toa 3D spatial position defined by x-, y-, and z-coordinates may be givenby the collection of data points corresponding to 3D spatial positionsthat are enclosed by the cubical region [14:18, 2:6, 44:48] (i.e., withx-coordinate value between 14 and 18, y-coordinate value between 2 and6, and z-coordinate value between 44 and 48). An example region in thepoint cloud 104 is illustrated by 118.

The shape embedding neural network 108 may be configured to process theentire point cloud 104 to generate an output that defines a respectiveembedding 116 of each region of the point cloud 104. Alternatively,rather than processing the entire point cloud 104 at once, the shapeembedding neural network 108 may be configured to process individualpoint cloud regions to generate respective embeddings of the point cloudregions.

Generally, the shape embedding neural network 108 can have anyappropriate neural network architecture that enables it to generateembeddings of point cloud regions. For example, the shape embeddingneural network 108 may have a PointNet architecture (i.e., as describedwith reference to: C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet:Deep Learning on Points Sets for 3D Classification and Segmentation”,2017, The IEEE Conference on Computer Vision and Pattern Recognition(CVPR)), a PointNet++ architecture (i.e., as described with reference toC. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep HierarchicalFeature Learning on Point Sets in a Metric Space”, 2017, Advances inNeural Information Processing Systems (NIPS)), or a VoxNet architecture(i.e., as described with reference to: D. Maturana and S. Scherer,“VoxNet: a 3D Convolutional Neural Network or Real-Time ObjectRecognition”, 2015, IEEE/RSJ International Conference of IntelligentRobots and Systems (IROS)).

The visual embedding neural network 106 and the shape embedding neuralnetwork 108 are jointly trained using an objective function thatencourages two properties.

First, for an image region and a point cloud region that characterizethe same area of an environment, the visual embedding neural network 106and the shape embedding neural network 108 should generate respectiveembeddings that are “similar” (e.g., according to an appropriatenumerical similarity measure).

Second, for an image region and a point cloud region that characterizedifferent areas of an environment (or different environmentsaltogether), the visual embedding neural network 106 and the shapeembedding neural network 108 should generate respective embeddings thatare “dissimilar” (e.g., according to an appropriate numerical similaritymeasure).

For example, the visual embedding neural network 106 and the shapeembedding neural network 108 can be jointly trained using, e.g., atriplet loss objective function or a contrastive loss objectivefunction.

An example process for jointly training the visual embedding neuralnetwork 106 and the shape embedding neural network 108 is described inmore detail with reference to FIG. 4.

The system 100 may generate respective embeddings of any appropriatenumber of image regions and point cloud regions. A few examples follow.

For example, the system 100 may generate respective embeddings for eachimage region in a grid of image regions that covers the image 102 andfor each point cloud region in a grid of point cloud regions that coversthe point cloud 104. In a particular example, the image regions may becomposed of non-overlapping 2D rectangular image regions, where eachpixel in the image 102 is included in exactly one of the image regions.In another particular example, the point cloud regions may be composedof non-overlapping 3D rectangular point cloud regions each data point inthe point cloud 104 is included in exactly one of the point cloudregions.

As another example, the system may generate respective embeddings foreach image region in a grid of image regions that covers a proper subsetof the image 102 and for each point cloud region in a grid of pointcloud regions that covers a proper subset of the point cloud 104. In aparticular example, the image regions may be composed of non-overlapping2D rectangular image regions, where each pixel that is included in anobject depicted in the image 102 is included in exactly one of the imageregions. In another particular example, the point cloud regions may becomposed of non-overlapping 3D rectangular point cloud regions, whereeach data point that is included in an object characterized by the pointcloud 104 is included in exactly one of the point cloud regions.

As another example, the system may generate embeddings only for regionsin the image and the point cloud having an “interest score” thatsatisfies a predetermined threshold. The system may determine theinterest score for an image region based on, e.g., the presence ofedges, corners, blobs, ridges, or a combination thereof, in the imageregion. The system may determine the interest score for a point cloudregion based on, e.g., the complexity of the point cloud region, e.g., asum of residuals between: (i) the points included in the region, and(ii) a linear surface fitted to the points included in the region.Generally, the system may be configured to generate embeddings for imageand point cloud regions that characterize unique features of theenvironment that can be effectively matched between the image and thepoint cloud. For example, the system may refrain from generatingembeddings for regions of the image or the point cloud that correspondto flat road without road markings. As another example, the system maydetermine that embeddings should be generated for regions of the imageand the point cloud that correspond to a portion of a vehicle or apedestrian.

The matching engine 110 is configured to process the embeddings 112 ofthe image regions and the embeddings 116 of the point cloud regions toidentify a set of embedding pairs 120. Each embedding pair 120specifies: (i) an embedding 112 of an image region, and (ii) anembedding 116 of a point cloud region.

The matching engine 110 attempts to identify embedding pairs 120 in amanner that maximizes (or approximately maximizes) a similarity measurebetween the respective embeddings included in each embedding pair 120.The similarity measure may be, e.g., an L₂ similarity measure, a cosinesimilarity measure, or any other appropriate similarity measure.

In a particular example, the matching engine 110 may use a “greedy”nearest neighbor matching algorithm to sequentially match each imageregion embedding 112 to a respective point cloud region embedding 116.For each given image region embedding 112, the matching engine 110identifies the corresponding point cloud region embedding 116 which ismost similar (i.e., according to a similarity measure) to the givenimage region embedding from among the currently unmatched point cloudregion embeddings. The greedy nearest neighbor matching algorithm mayterminate when each image region embedding is matched to a correspondingpoint cloud region embedding, or when no unmatched point cloud regionembeddings remain.

More generally, the matching engine 110 can use any appropriate matchingalgorithm to identify the embedding pairs 120. Some examples of nearestneighbor matching algorithms are described with reference to, e.g.: M.Muja, D. G. Lowe, “Fast approximate nearest neighbors with automaticalgorithm configuration”, 2009, VISAPP International Conference onComputer Vision Theory and Applications.

The system 100 uses each embedding pair 120 identified by the matchingengine 110 to identify a respective region pair 122 that specifies theimage region and the point cloud region corresponding to the embeddingpair 120. In this manner, the system 100 identifies a set of regionpairs 122 that each specify an image region and a point cloud regionthat are predicted to characterize the same area of the environment.That is, the region pairs 122 define a mapping between respectiveregions of the image 102 and the point cloud 104 that are predicted tocharacterize the same area of the environment.

For example, a region pair 122 may specify a region of the image 102 anda region of the point cloud 104 that are both predicted to characterizethe same object or the same part of the same object in the environment.Objects in the environment may be, e.g., people, animals, cars, roadsigns, and the like.

The mapping between corresponding image regions and point cloud regionsthat is defined by the regions pairs 122 can be used for any of avariety of purposes. A few examples of using the cross-modal alignmentsystem 100 in an on-board system of a vehicle are described withreference to FIG. 3.

FIG. 2 is an illustration 200 an example of a mapping between respectiveregions of an image 202 and a point cloud 204 (e.g., of the sort thatcan be determined by the cross-modal alignment system 100 described withreference to FIG. 1). In this example, the image regions 206-A and 206-Bare respectively mapped the point cloud regions 208-A and 208-B (andvice versa). That is, the image region 206-A and the point cloud region208-A form a first “region pair”, and the image region 206-B and thepoint cloud region 208-B form a second “region pair” (as described withreference to FIG. 1).

FIG. 3 is a block diagram of an example on-board system 300 of a vehicle302. The on-board system 300 is composed of hardware and softwarecomponents, some or all of which are physically located on-board thevehicle 302. As will be described in more detail below, the on-boardsystem 300 can use the alignment system 100 (as described with referenceto FIG. 1) for any of a variety of purposes.

In some cases, the on-board system 300 can make fully-autonomous orpartly-autonomous driving decisions (i.e., driving decisions takenindependently of the driver of the vehicle 302), present information tothe driver of the vehicle 302 to assist the driver in operating thevehicle safely, or both. For example, in response to determining thatanother vehicle is unlikely to yield for the vehicle 302, the on-boardsystem 300 may autonomously apply the brakes of the vehicle 302 orotherwise autonomously change the trajectory of the vehicle 302 toprevent a collision between the vehicle 302 and the other vehicle. Asanother example, in response to determining that another vehicle isunlikely to yield for the vehicle 302, the on-board system 300 maypresent an alert message to the driver of the vehicle 302 withinstructions to adjust the trajectory of the vehicle 302 to avoid acollision with the other vehicle.

Although the vehicle 302 in FIG. 3 is depicted as an automobile, and theexamples in this document are described with reference to automobiles,in general the vehicle 302 can be any kind of vehicle. For example,besides an automobile, the vehicle 302 can be a watercraft or anaircraft. Moreover, the on-board system 300 can include componentsadditional to those depicted in FIG. 3 (e.g., a collision detectionsystem or a navigation system).

The on-board system 300 includes a sensor system 304 that enables theon-board system 300 to “see” the environment in the vicinity of thevehicle 302. More specifically, the sensor system 304 includes sensorsof multiple different modalities, in particular, camera sensors andsurveying sensors (e.g., lidar sensors, radar sensors, or both).

The sensor system 304 continually (i.e., at each of multiple timepoints) generates images 306 characterizing the visual appearance of theenvironment in the vicinity of the vehicle and point clouds 308characterizing the 3D geometry of the environment in the vicinity of thevehicle.

The alignment system 100 can process an image 306 and a point cloud 308generated by the sensor system 304 to determine a mapping betweenrespective regions of the image 306 and the point cloud 308 that arepredicted to characterize the same area of the environment. The outputof the alignment system 100 can be used by any of a variety of othersystems on-board the vehicle 302, e.g., a calibration system 310, alocalization system 312, and a prediction system 314, as will bedescribed in more detail below.

The calibration system 310 is configured to maintain calibration datathat characterizes the positions and orientations of some or all of thesensors mounted on the vehicle 302. For example, for each sensor mountedon the vehicle 302, the calibration system 310 may maintain calibrationdata that includes: (i) a 3D vector defining x-, y-, and z-coordinatesof the position of the sensor on the vehicle, and (ii) a 3D vectordefining x-, y-, and z-coordinates of the orientation of the sensor(i.e., the direction the sensor is pointing).

The calibration system 310 can continually (i.e., at each of multipletime points) check the current accuracy of the calibration data. Thecalibration data may become inaccurate over time due to changes in thepositions and orientations of the sensors. The positions andorientations of the sensors can change over time, e.g., due totemperature variations causing slight deformations to the portion of thevehicle where the sensor is mounted, due to objects (e.g., treebranches) brushing the sensors, or due to abrupt changes in the speed ofthe vehicle.

The calibration system 310 can use the output of the alignment system100 to cross-check the accuracy of the current calibration data for acamera sensor and a surveying sensor (e.g., lidar or radar sensor). Forexample, the calibration system 310 can use the output of the alignmentsystem to determine the parameters of a transformation (e.g., atranslation and rotation transformation) that aligns the centerpositions of matching regions of the image (generated by the camerasensor) and the point cloud (generated by the surveying sensor). Aregion of the image is said to “match” a region of the point cloud ifthe output of the alignment system indicates that they correspond to thesame area of the environment. The calibration system 310 can determinethe parameters of the transformation using any appropriate fittingmethod, e.g., a least squares fitting method with random sampleconsensus (RANSAC), or a robustified non-linear least squares fittingmethod.

After determining the parameters of the transformation from the image tothe point cloud (or vice versa), the calibration system 310 can applythe transformation to the calibration parameters characterizing theposition and orientation of the camera sensor. The results of applyingthe transformation to the calibration parameters for the camera sensordefine an estimate for the calibration parameters characterizing theposition and orientation of the surveying sensor. In response todetermining that the estimate for the calibration parameters of thesurveying sensor differs by at least a threshold amount from themaintained calibration parameters of the surveying sensor, thecalibration system may determine that one or both of the surveyingsensor and the camera sensor are miscalibrated. In response todetermining that sensors are miscalibrated, the on-board system may,e.g., alert the driver of the vehicle, or cause the vehicle to pullover.

The localization system 312 can process an image 306, a point cloud 308,and data defining the position of an object in sensor data of one of themodalities (i.e., either the image 306 or the point cloud 308), todetermine the position of the same object in the sensor data of theother modality.

For example, the localization system 312 can process an image 306, apoint cloud 308, and data defining the position of an object in theimage 306, to generate data defining the position of the same object inthe point cloud 308.

As another example, the localization system 312 can process an image306, a point cloud 308, and data defining the position of an object inthe point cloud 308, to generate data defining the position of the sameobject in the image 306.

The position of an object in an image or in a point cloud can berepresented in any appropriate manner. For example, the position of anobject in an image can be represented by a 2D bounding box that enclosesthe object in the image. As another example, the position of an objectin a point cloud can be represented by a 3D bounding box that enclosesthe object in the point cloud.

The localization system 312 can use the alignment system 100 to identifythe position of an object in a point cloud based on the position of theobject in an image. For example, the localization system 312 can use thealignment system 100 to generate: (i) embeddings of one or more regionsof the image that cover the object in the image, and (ii) embeddings ofa grid of regions in the point cloud that cover the entire point cloud.

The localization system 312 can map the image regions that cover theobject in the image to corresponding point cloud regions, e.g., bymatching the embeddings of the image regions to corresponding embeddingsof point cloud regions, e.g., as described with reference to FIG. 1.Thereafter, the localization system 312 can determine the position ofthe object in the point cloud based on the point cloud regions that aremapped onto by the image regions that cover the object in the image.

Similarly, the localization system 312 can also use the alignment system100 to identify the position of an object in an image based on theposition of the object in a point cloud.

The on-board system 300 can use the localization system 312 in any of avariety of circumstances. For example, the on-board system 300 may trackanother vehicle using camera sensor data while the other vehicle is outof range of the lidar sensor of the vehicle 302. Once the other vehiclecomes into range of the lidar sensor of the vehicle 302, the on-boardsystem 300 can use the localization system 312 to determine the positionof the other vehicle in the point cloud data generated by the lidarsensor. Having localized the other vehicle in both the camera sensordata and the lidar sensor data, the on-board system 300 can use sensordata of both modalities, e.g., to predict the behavior of the othervehicle.

The prediction system 314 is configured to process images 306 and pointclouds 308 generated by the sensor system 304, e.g., to detect andidentify objects (e.g., vehicles, pedestrians, road signs, and the like)in the vicinity of the vehicle 302. The prediction system 314 may usethe alignment system 100 to align image data and point cloud datagenerated by the sensor system before processing it, e.g., using one ormore neural networks.

In one example, the prediction system 314 may use the alignment system100 to generate a mapping between respective regions of an image 306 anda point cloud 308 that are predicted to characterize the same areas ofthe environment. Thereafter, the prediction system 314 may use themapping to project the point cloud 308 onto a 2D image plane that isaligned with the image 306, and then provide the projected point cloud308 and the image 306 to an object detection neural network.

In some cases, data characterizing the relative positions andorientations of a camera sensor and a surveying sensor can be used toapproximately align the images generated by the camera sensor and thepoint clouds generated by the surveying sensor without using thealignment system 100. However, aligning images and point clouds based onthe relative positions of the camera sensor and surveying sensor thatgenerated them can be inaccurate, particularly when the vehicle is inmotion. More specifically, the camera sensor and the surveying sensoroften generate data at different time points (e.g., 0.2 seconds apart).In the duration of time that elapses between when the camera sensorgenerates an image and when the surveying sensor generates a pointcloud, the vehicle can move relative to the environment. In thissituation, attempting to align the image and the point cloud based onthe relative positions and orientations of the camera and surveyingsensors can result in an inaccurate alignment. On the other hand, thealignment system 100 can accurately align images and point clouds whenthe vehicle 302 is in motion, even if the relative positions andorientations of the surveying and camera sensors are inaccurate orunknown.

In addition to being used by an on-board system of a vehicle (e.g., asdescribed with reference to FIG. 3), the alignment system can be used ina variety of other applications. For example, the alignment system canbe used to generate training data for training an object detectionneural network that is configured to process a point cloud to generatean output that identifies the positions of objects in the point cloud.Manually segmenting objects in point clouds for use as training data fortraining the object detection neural network may be difficult,time-consuming, and expensive. The alignment system can obviate thesechallenges, since it can be used to transfer segmentations from images(which can be readily obtained) onto corresponding point cloud data,which can subsequently be used to train the object detection neuralnetwork.

FIG. 4 illustrates an example data flow 400 for jointly training thevisual embedding neural network 106 and the shape embedding neuralnetwork 108. Training the visual embedding neural network 106 and theshape embedding neural network 108 refers to determining trained valuesof their respective model parameters 402.

The visual embedding neural network 106 and the shape embedding neuralnetwork 108 are trained on a set of training data 404 that includesmultiple training examples. Each training example includes a region ofan image and a region of a point cloud. Some of the training examplesare “positive” training examples, where the image region and the pointcloud region characterize the same area of an environment. The remainderof the training examples are “negative” training examples, where theimage region and the point cloud region characterize different areas ofan environment (or different environment altogether).

The training examples of the training data 404 can be generated in anyof a variety of ways.

For example, positive training examples can be generated by using acamera sensor and a surveying sensor that have known positions andorientations relative to one another to simultaneously capture an imageand a point cloud characterizing an environment. The relative positionsand orientations of the sensors can be used to align the image and thepoint cloud, and one or more training examples can be generated byextracting pairs of corresponding regions from the aligned sensor data.

As another example, positive training examples can be manually generatedby human annotation, where a person manually annotates correspondingimage regions and point cloud regions that characterize the same area ofthe environment.

As another example, negative training examples can be generated byrandomly pairing image regions and point cloud regions characterizingareas of different environments.

At each of multiple training iterations, a “batch” (i.e., set) of one ormore training examples 406 are selected (e.g., randomly) from thetraining data 404.

For each training example 406 in the batch, the visual embedding neuralnetwork 106 processes the image region 408 from the training example 406in accordance with current values of the model parameters 402 togenerate an embedding 412 of the image region 408. Similarly, the shapeembedding neural network 108 processes the point cloud region 410 fromthe training example 406 in accordance with current values of the modelparameters 402 to generate an embedding 414 of the point cloud region410.

The embeddings of the image regions 408 and the point cloud regions 410from the training examples 406 of the current batch are used to evaluatean objective function 416. Gradients of the objective function 416 arecomputed (e.g., using backpropagation), and are thereafter used toupdate the current values of the model parameters 402 (e.g., using anRMSprop or Adam gradient descent optimization procedure).

As described earlier, the objective function 416 broadly encourages thevisual embedding neural network 106 and the shape embedding neuralnetwork 108 to generate similar embeddings of image regions and pointcloud regions if and only if they characterize the same area of anenvironment. For example, the objective function may be a triplet lossobjective function or a contrastive loss objective function.

The model parameters 402 of the visual embedding neural network 106 andthe shape embedding neural network 108 may be trained until a trainingtermination criterion is satisfied, e.g., when a predetermined number oftraining iterations have been performed. The trained values of the modelparameters 402 may be transmitted to an on-board system of a vehicle(e.g., as described with reference to FIG. 3) over any appropriate wiredor wireless connection.

FIG. 5 is a flow diagram of an example process 500 for determining amapping between respective regions of an image and a point cloud thatcharacterize the same area of an environment. For convenience, theprocess 500 will be described as being performed by a system of one ormore computers located in one or more locations. For example, across-modal alignment system, e.g., the cross-modal alignment system 100of FIG. 1, appropriately programmed in accordance with thisspecification, can perform the process 500.

The system obtains an image and a point cloud region (502). The image isgenerated by a camera sensor and characterizes a visual appearance ofthe environment. The point cloud is generated by a surveying sensor andcharacterizes a 3D geometry of the environment. The image can berepresented in any appropriate format, e.g., as a black and white imageor as a color image. The point cloud can be represented as a collectionof data points, where each data point defines a respective 3D spatialposition of a point on a surface in the environment. Optionally, eachdata point in the point cloud may include additional “intensity”information that characterizes, e.g., the reflectivity, texture, ordensity of the material at the 3D spatial position in the environmentcorresponding to the data point.

The system processes each of multiple regions of the image using avisual embedding neural network to generate a respective embedding ofeach of the image regions (504). Each image region may corresponding toa portion of the image enclosed by a 2D bounding region of anyappropriate shape, e.g., a 2D bounding box. The system may generaterespective embeddings for each image region in a grid of image regionsthat covers the entire image, or for each image region in a set of imageregions that covers a portion of the image (e.g., a portion of the imagethat depicts an object).

The system processes each of multiple regions of the point cloud using ashape embedding neural network to generate a respective embedding ofeach of the point cloud regions (506). Each point cloud region maycorrespond to a portion of the point cloud (i.e., a set of data pointsin the point cloud) corresponding to spatial positions enclosed by a 3Dspatial bounding region of any appropriate shape, e.g., a 3D boundingbox. The system may generate respective embeddings for each point cloudregion in a grid of point cloud regions that covers the entire pointcloud, or for each point cloud region in a set of point cloud regionsthat covers a portion of the point cloud (e.g., a portion of the pointcloud that corresponds to an object).

The system identifies a set of multiple region pairs using theembeddings of the image regions and the embeddings of the point cloudregions (508) Each region pair specifies an image region and a pointcloud region that characterize the same area of the environment. Toidentify the region pairs, the system uses a matching algorithm (e.g., anearest neighbor matching algorithm) to identify a set of multipleembedding pairs, each of which specifies an embedding of an image regionand an embedding of a point cloud region. The system uses each embeddingpair to identify a respective region pair that specifies the imageregion and the point cloud region corresponding to the embedding pair.The region pairs define a mapping between respective regions of theimage and the point cloud that are predicted to characterize the samearea of the environment.

This specification uses the term “configured” in connection with systemsand computer program components. For a system of one or more computersto be configured to perform particular operations or actions means thatthe system has installed on it software, firmware, hardware, or acombination of them that in operation cause the system to perform theoperations or actions. For one or more computer programs to beconfigured to perform particular operations or actions means that theone or more programs include instructions that, when executed by dataprocessing apparatus, cause the apparatus to perform the operations oractions.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly-embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them. Embodiments of the subject matter described in thisspecification can be implemented as one or more computer programs, i.e.,one or more modules of computer program instructions encoded on atangible non-transitory storage medium for execution by, or to controlthe operation of, data processing apparatus. The computer storage mediumcan be a machine-readable storage device, a machine-readable storagesubstrate, a random or serial access memory device, or a combination ofone or more of them. Alternatively or in addition, the programinstructions can be encoded on an artificially-generated propagatedsignal, e.g., a machine-generated electrical, optical, orelectromagnetic signal, that is generated to encode information fortransmission to suitable receiver apparatus for execution by a dataprocessing apparatus.

The term “data processing apparatus” refers to data processing hardwareand encompasses all kinds of apparatus, devices, and machines forprocessing data, including by way of example a programmable processor, acomputer, or multiple processors or computers. The apparatus can alsobe, or further include, special purpose logic circuitry, e.g., an FPGA(field programmable gate array) or an ASIC (application-specificintegrated circuit). The apparatus can optionally include, in additionto hardware, code that creates an execution environment for computerprograms, e.g., code that constitutes processor firmware, a protocolstack, a database management system, an operating system, or acombination of one or more of them.

A computer program, which may also be referred to or described as aprogram, software, a software application, an app, a module, a softwaremodule, a script, or code, can be written in any form of programminglanguage, including compiled or interpreted languages, or declarative orprocedural languages; and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A program may, but neednot, correspond to a file in a file system. A program can be stored in aportion of a file that holds other programs or data, e.g., one or morescripts stored in a markup language document, in a single file dedicatedto the program in question, or in multiple coordinated files, e.g.,files that store one or more modules, sub-programs, or portions of code.A computer program can be deployed to be executed on one computer or onmultiple computers that are located at one site or distributed acrossmultiple sites and interconnected by a data communication network.

In this specification the term “engine” is used broadly to refer to asoftware-based system, subsystem, or process that is programmed toperform one or more specific functions. Generally, an engine will beimplemented as one or more software modules or components, installed onone or more computers in one or more locations. In some cases, one ormore computers will be dedicated to a particular engine; in other cases,multiple engines can be installed and running on the same computer orcomputers.

The processes and logic flows described in this specification can beperformed by one or more programmable computers executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby special purpose logic circuitry, e.g., an FPGA or an ASIC, or by acombination of special purpose logic circuitry and one or moreprogrammed computers.

Computers suitable for the execution of a computer program can be basedon general or special purpose microprocessors or both, or any other kindof central processing unit. Generally, a central processing unit willreceive instructions and data from a read-only memory or a random accessmemory or both. The essential elements of a computer are a centralprocessing unit for performing or executing instructions and one or morememory devices for storing instructions and data. The central processingunit and the memory can be supplemented by, or incorporated in, specialpurpose logic circuitry. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto-optical disks, or optical disks. However, a computer need nothave such devices. Moreover, a computer can be embedded in anotherdevice, e.g., a mobile telephone, a personal digital assistant (PDA), amobile audio or video player, a game console, a Global PositioningSystem (GPS) receiver, or a portable storage device, e.g., a universalserial bus (USB) flash drive, to name just a few.

Computer-readable media suitable for storing computer programinstructions and data include all forms of non-volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto-optical disks; andCD-ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's device in response to requests received from the web browser.Also, a computer can interact with a user by sending text messages orother forms of message to a personal device, e.g., a smartphone that isrunning a messaging application, and receiving responsive messages fromthe user in return.

Data processing apparatus for implementing machine learning models canalso include, for example, special-purpose hardware accelerator unitsfor processing common and compute-intensive parts of machine learningtraining or production, i.e., inference, workloads.

Machine learning models can be implemented and deployed using a machinelearning framework, e.g., a TensorFlow framework, a Microsoft CognitiveToolkit framework, an Apache Singa framework, or an Apache MXNetframework.

Embodiments of the subject matter described in this specification can beimplemented in a computing system that includes a back-end component,e.g., as a data server, or that includes a middleware component, e.g.,an application server, or that includes a front-end component, e.g., aclient computer having a graphical user interface, a web browser, or anapp through which a user can interact with an implementation of thesubject matter described in this specification, or any combination ofone or more such back-end, middleware, or front-end components. Thecomponents of the system can be interconnected by any form or medium ofdigital data communication, e.g., a communication network. Examples ofcommunication networks include a local area network (LAN) and a widearea network (WAN), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other. In someembodiments, a server transmits data, e.g., an HTML page, to a userdevice, e.g., for purposes of displaying data to and receiving userinput from a user interacting with the device, which acts as a client.Data generated at the user device, e.g., a result of the userinteraction, can be received at the server from the device.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or on the scope of what may be claimed, but rather asdescriptions of features that may be specific to particular embodimentsof particular inventions. Certain features that are described in thisspecification in the context of separate embodiments can also beimplemented in combination in a single embodiment. Conversely, variousfeatures that are described in the context of a single embodiment canalso be implemented in multiple embodiments separately or in anysuitable subcombination. Moreover, although features may be describedabove as acting in certain combinations and even initially be claimed assuch, one or more features from a claimed combination can in some casesbe excised from the combination, and the claimed combination may bedirected to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited inthe claims in a particular order, this should not be understood asrequiring that such operations be performed in the particular ordershown or in sequential order, or that all illustrated operations beperformed, to achieve desirable results. In certain circumstances,multitasking and parallel processing may be advantageous. Moreover, theseparation of various system modules and components in the embodimentsdescribed above should not be understood as requiring such separation inall embodiments, and it should be understood that the described programcomponents and systems can generally be integrated together in a singlesoftware product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In some cases, multitasking and parallel processing may beadvantageous.

1-20. (canceled)
 21. A method performed by one or more data processingapparatus for aligning multi-modal sensor data, the method comprising:obtaining multi-modal sensor data characterizing an environment, whereinthe multi-modal sensor data comprises: (i) first sensor data generatedby a first sensor modality, and (ii) second sensor data generated by asecond sensor modality, wherein the second sensor modality is differentthan the first sensor modality; processing each of a plurality ofregions of the first sensor data using a first embedding neural networkthat is specific to the first sensor modality to generate a respectiveregion embedding of each of the plurality of regions of the first sensordata; processing each of a plurality of regions of the second sensordata using a second embedding neural network that is specific to thesecond sensor modality to generate a respective region embedding of eachof the plurality of regions of the second sensor data; determining aplurality of similarity scores, wherein each similarity score measures asimilarity between a region embedding of a respective region of thefirst sensor data and a region embedding of a respective region of thesecond sensor data; and identifying a plurality of region embeddingpairs that collectively define an alignment of the first sensor data andthe second sensor data based on the plurality of similarity scores,wherein each region embedding pair comprises a region embedding of arespective region of the first sensor data and a region embedding of arespective region of the second sensor data.
 22. The method of claim 21,wherein the first sensor modality is an imaging modality and the firstsensor data comprises an image that characterizes a visual appearance ofthe environment.
 23. The method of claim 21, wherein the second sensormodality is a surveying sensor modality and the second sensor datacomprises a point cloud, wherein the point cloud includes a collectionof data points that characterize a three-dimensional geometry of theenvironment, wherein each data point defines a respectivethree-dimensional spatial position of a point on a surface in theenvironment.
 24. The method of claim 23, wherein the second sensor datais captured by a lidar sensor or a radar sensor.
 25. The method of claim24, wherein the second sensor data is captured by a lidar sensor, andeach data point in the point cloud additionally defines a strength of areflection of a pulse of light that was transmitted by the lidar sensorand that reflected from the point on the surface of the environment atthe three-dimensional spatial position defined by the data point. 26.The method of claim 21, wherein the first sensor data and the secondsensor data are captured by sensors mounted on a vehicle.
 27. The methodof claim 21, further comprising: using the alignment of the first sensordata and the second sensor data to determine whether a first sensor thatcaptured the first sensor data and a second sensor that captured thesecond sensor data are accurately calibrated.
 28. The method of claim21, further comprising: obtaining data defining a position of an objectin the first sensor data; and identifying a corresponding position ofthe object in the second sensor data based on: (i) the position of theobject in the first sensor data, and (ii) the alignment of the firstsensor data and the second sensor data.
 29. The method of claim 21,further comprising: generating fused sensor data by fusing the firstsensor data and the second sensor data using the alignment of the firstsensor data and the second sensor data; and processing the fused sensordata using a neural network to generate a neural network output.
 30. Themethod of claim 29, wherein the neural network output comprises dataidentifying positions of objects in the environment.
 31. The method ofclaim 21, wherein the plurality of regions of the first sensor datacover the first sensor data.
 32. The method of claim 21, wherein theplurality of regions of the second sensor data cover the second sensordata.
 33. The method of claim 21, wherein the plurality of regionembedding pairs are identified using a greedy nearest neighbor matchingalgorithm.
 34. The method of claim 21, wherein the first embeddingneural network and the second embedding neural network are jointlytrained using a triplet loss objective function or a contrastive lossobjective function.
 35. One or more non-transitory computer storagemedia storing instructions that when executed by one or more computerscause the one or more computers to perform operations for aligningmulti-modal sensor data, the operations comprising: obtainingmulti-modal sensor data characterizing an environment, wherein themulti-modal sensor data comprises: (i) first sensor data generated by afirst sensor modality, and (ii) second sensor data generated by a secondsensor modality, wherein the second sensor modality is different thanthe first sensor modality; processing each of a plurality of regions ofthe first sensor data using a first embedding neural network that isspecific to the first sensor modality to generate a respective regionembedding of each of the plurality of regions of the first sensor data;processing each of a plurality of regions of the second sensor datausing a second embedding neural network that is specific to the secondsensor modality to generate a respective region embedding of each of theplurality of regions of the second sensor data; determining a pluralityof similarity scores, wherein each similarity score measures asimilarity between a region embedding of a respective region of thefirst sensor data and a region embedding of a respective region of thesecond sensor data; and identifying a plurality of region embeddingpairs that collectively define an alignment of the first sensor data andthe second sensor data based on the plurality of similarity scores,wherein each region embedding pair comprises a region embedding of arespective region of the first sensor data and a region embedding of arespective region of the second sensor data.
 36. The non-transitorycomputer storage media of claim 35, wherein the first sensor modality isan imaging modality and the first sensor data comprises an image thatcharacterizes a visual appearance of the environment.
 37. Thenon-transitory computer storage media of claim 35, wherein the secondsensor modality is a surveying sensor modality and the second sensordata comprises a point cloud, wherein the point cloud includes acollection of data points that characterize a three-dimensional geometryof the environment, wherein each data point defines a respectivethree-dimensional spatial position of a point on a surface in theenvironment.
 38. A system comprising: one or more computers; and one ormore storage devices communicatively coupled to the one or morecomputers, wherein the one or more storage devices store instructionsthat, when executed by the one or more computers, cause the one or morecomputers to perform operations for aligning multi-modal sensor data,the operations comprising: obtaining multi-modal sensor datacharacterizing an environment, wherein the multi-modal sensor datacomprises: (i) first sensor data generated by a first sensor modality,and (ii) second sensor data generated by a second sensor modality,wherein the second sensor modality is different than the first sensormodality; processing each of a plurality of regions of the first sensordata using a first embedding neural network that is specific to thefirst sensor modality to generate a respective region embedding of eachof the plurality of regions of the first sensor data; processing each ofa plurality of regions of the second sensor data using a secondembedding neural network that is specific to the second sensor modalityto generate a respective region embedding of each of the plurality ofregions of the second sensor data; determining a plurality of similarityscores, wherein each similarity score measures a similarity between aregion embedding of a respective region of the first sensor data and aregion embedding of a respective region of the second sensor data; andidentifying a plurality of region embedding pairs that collectivelydefine an alignment of the first sensor data and the second sensor databased on the plurality of similarity scores, wherein each regionembedding pair comprises a region embedding of a respective region ofthe first sensor data and a region embedding of a respective region ofthe second sensor data.
 39. The system of claim 38, wherein the firstsensor modality is an imaging modality and the first sensor datacomprises an image that characterizes a visual appearance of theenvironment.
 40. The system of claim 38, wherein the second sensormodality is a surveying sensor modality and the second sensor datacomprises a point cloud, wherein the point cloud includes a collectionof data points that characterize a three-dimensional geometry of theenvironment, wherein each data point defines a respectivethree-dimensional spatial position of a point on a surface in theenvironment.